👋 Hello folks! Welcome to this exciting mini-project on Car Price Prediction!
In this project, we take a real-world dataset and walk through the entire Machine Learning pipeline — from data exploration to building regression models and improving results.
We begin with EDA (Exploratory Data Analysis) to understand how different features like engine size, horsepower, and mileage affect car prices. Then, we dive into data preprocessing, cleaning the data and preparing it for modeling.
We first apply Multiple Linear Regression to make predictions. But we didn’t stop there! To improve our results, we used Polynomial Regression, which helped us better capture non-linear patterns and achieve lower error rates and a more accurate model.
The project is built on the Car Price dataset, and everything is done using Python and popular libraries like pandas, matplotlib, seaborn, and scikit-learn.
Hope you enjoy the journey! 🚗📊✨
# Import the required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
Read the data using pandas¶
data = pd.read_csv("D:/NRIT Solutions/ML/Presentation/Linear Regression/Multi Linear Regression/Car Price Prediction Multiple Linear Regression/CarPrice_Assignment.csv")
data.head()
| car_ID | symboling | CarName | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | ... | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | alfa-romero giulia | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.0 |
| 1 | 2 | 3 | alfa-romero stelvio | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.0 |
| 2 | 3 | 1 | alfa-romero Quadrifoglio | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.0 |
| 3 | 4 | 2 | audi 100 ls | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.0 |
| 4 | 5 | 2 | audi 100ls | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.0 |
5 rows × 26 columns
EDA¶
data_EDA = data.copy()
data_EDA.head()
| car_ID | symboling | CarName | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | ... | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | alfa-romero giulia | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.0 |
| 1 | 2 | 3 | alfa-romero stelvio | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.0 |
| 2 | 3 | 1 | alfa-romero Quadrifoglio | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.0 |
| 3 | 4 | 2 | audi 100 ls | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.0 |
| 4 | 5 | 2 | audi 100ls | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.0 |
5 rows × 26 columns
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='fueltype',multiple='stack')
plt.ylabel("Count of Car Sales")
plt.xlabel("Price")
plt.title("Price vs Fueltype")
Text(0.5, 1.0, 'Price vs Fueltype')
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='aspiration',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Text(0.5, 1.0, 'Price vs Fueltype')
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='doornumber',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
<Axes: xlabel='price', ylabel='Count'>
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='carbody',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
<Axes: xlabel='price', ylabel='Count'>
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='drivewheel',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
<Axes: xlabel='price', ylabel='Count'>
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='enginelocation',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
<Axes: xlabel='price', ylabel='Count'>
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='enginetype',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
<Axes: xlabel='price', ylabel='Count'>
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='cylindernumber',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
<Axes: xlabel='price', ylabel='Count'>
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='fuelsystem',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
<Axes: xlabel='price', ylabel='Count'>
#boxplot()
sns.boxplot(data=data_EDA,x='price',hue='aspiration')
<Axes: xlabel='price'>
#convert cat into int
from sklearn.preprocessing import LabelEncoder
lab_obj = LabelEncoder()
data_EDA["fueltype"] = lab_obj.fit_transform(data_EDA["fueltype"])
data_EDA["aspiration"] = lab_obj.fit_transform(data_EDA["aspiration"])
data_EDA["doornumber"] = lab_obj.fit_transform(data_EDA["doornumber"])
data_EDA["carbody"] = lab_obj.fit_transform(data_EDA["carbody"])
data_EDA["drivewheel"] = lab_obj.fit_transform(data_EDA["drivewheel"])
data_EDA["enginelocation"] = lab_obj.fit_transform(data_EDA["enginelocation"])
data_EDA["enginetype"] = lab_obj.fit_transform(data_EDA["enginetype"])
data_EDA["cylindernumber"] = lab_obj.fit_transform(data_EDA["cylindernumber"])
data_EDA["fuelsystem"] = lab_obj.fit_transform(data_EDA["fuelsystem"])
#Drop unwanted columns
data_EDA = data_EDA.drop("car_ID",axis=1)
data_EDA = data_EDA.drop("CarName",axis=1)
data_EDA.head()
data_EDA.head()
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | ... | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 1 | 0 | 1 | 0 | 2 | 0 | 88.6 | 168.8 | 64.1 | ... | 130 | 5 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.0 |
| 1 | 3 | 1 | 0 | 1 | 0 | 2 | 0 | 88.6 | 168.8 | 64.1 | ... | 130 | 5 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.0 |
| 2 | 1 | 1 | 0 | 1 | 2 | 2 | 0 | 94.5 | 171.2 | 65.5 | ... | 152 | 5 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.0 |
| 3 | 2 | 1 | 0 | 0 | 3 | 1 | 0 | 99.8 | 176.6 | 66.2 | ... | 109 | 5 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.0 |
| 4 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 99.4 | 176.6 | 66.4 | ... | 136 | 5 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.0 |
5 rows × 24 columns
import matplotlib.pyplot as plt
import seaborn as sns
# Compute correlation matrix
correlation_matrix = data_EDA.corr()
# Set up the matplotlib figure
plt.figure(figsize=(14, 10)) # Increased size
# Create the heatmap
sns.heatmap(
correlation_matrix,
annot=True, # Show correlation numbers
fmt=".2f", # Format as decimal
cmap="coolwarm", # Color map
linewidths=0.5, # Line between boxes
annot_kws={"size": 8} # Smaller font size inside boxes
)
# Rotate axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=9)
plt.yticks(rotation=0, fontsize=9)
# Title
plt.title("Correlation Heatmap of Numerical Features", fontsize=14)
# Show plot
plt.tight_layout()
plt.show()
#pairplot()
sns.pairplot(data_EDA)
<seaborn.axisgrid.PairGrid at 0x218e010a960>
#pairplot()
sns.pairplot(data_EDA,hue='fueltype')
<seaborn.axisgrid.PairGrid at 0x218fcc17350>
Data Preprocessing¶
#Cheking for Nan values
data.isnull().sum()
car_ID 0 symboling 0 CarName 0 fueltype 0 aspiration 0 doornumber 0 carbody 0 drivewheel 0 enginelocation 0 wheelbase 0 carlength 0 carwidth 0 carheight 0 curbweight 0 enginetype 0 cylindernumber 0 enginesize 0 fuelsystem 0 boreratio 0 stroke 0 compressionratio 0 horsepower 0 peakrpm 0 citympg 0 highwaympg 0 price 0 dtype: int64
our data dont have any Nan values if we have any Nan we need to missing values imputation techniques¶
## Basics Statistics for our data
data.describe()
| car_ID | symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | ... | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | ... | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 |
| mean | 103.000000 | 0.834146 | 0.902439 | 0.180488 | 0.439024 | 2.614634 | 1.326829 | 0.014634 | 98.756585 | 174.049268 | ... | 126.907317 | 3.253659 | 3.329756 | 3.255415 | 10.142537 | 104.117073 | 5125.121951 | 25.219512 | 30.751220 | 13276.710571 |
| std | 59.322565 | 1.245307 | 0.297446 | 0.385535 | 0.497483 | 0.859081 | 0.556171 | 0.120377 | 6.021776 | 12.337289 | ... | 41.642693 | 2.013204 | 0.270844 | 0.313597 | 3.972040 | 39.544167 | 476.985643 | 6.542142 | 6.886443 | 7988.852332 |
| min | 1.000000 | -2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 86.600000 | 141.100000 | ... | 61.000000 | 0.000000 | 2.540000 | 2.070000 | 7.000000 | 48.000000 | 4150.000000 | 13.000000 | 16.000000 | 5118.000000 |
| 25% | 52.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 1.000000 | 0.000000 | 94.500000 | 166.300000 | ... | 97.000000 | 1.000000 | 3.150000 | 3.110000 | 8.600000 | 70.000000 | 4800.000000 | 19.000000 | 25.000000 | 7788.000000 |
| 50% | 103.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 1.000000 | 0.000000 | 97.000000 | 173.200000 | ... | 120.000000 | 5.000000 | 3.310000 | 3.290000 | 9.000000 | 95.000000 | 5200.000000 | 24.000000 | 30.000000 | 10295.000000 |
| 75% | 154.000000 | 2.000000 | 1.000000 | 0.000000 | 1.000000 | 3.000000 | 2.000000 | 0.000000 | 102.400000 | 183.100000 | ... | 141.000000 | 5.000000 | 3.580000 | 3.410000 | 9.400000 | 116.000000 | 5500.000000 | 30.000000 | 34.000000 | 16503.000000 |
| max | 205.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 2.000000 | 1.000000 | 120.900000 | 208.100000 | ... | 326.000000 | 7.000000 | 3.940000 | 4.170000 | 23.000000 | 288.000000 | 6600.000000 | 49.000000 | 54.000000 | 45400.000000 |
8 rows × 25 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 car_ID 205 non-null int64 1 symboling 205 non-null int64 2 CarName 205 non-null object 3 fueltype 205 non-null int64 4 aspiration 205 non-null int64 5 doornumber 205 non-null int64 6 carbody 205 non-null int64 7 drivewheel 205 non-null int64 8 enginelocation 205 non-null int64 9 wheelbase 205 non-null float64 10 carlength 205 non-null float64 11 carwidth 205 non-null float64 12 carheight 205 non-null float64 13 curbweight 205 non-null int64 14 enginetype 205 non-null int64 15 cylindernumber 205 non-null int64 16 enginesize 205 non-null int64 17 fuelsystem 205 non-null int64 18 boreratio 205 non-null float64 19 stroke 205 non-null float64 20 compressionratio 205 non-null float64 21 horsepower 205 non-null int64 22 peakrpm 205 non-null int64 23 citympg 205 non-null int64 24 highwaympg 205 non-null int64 25 price 205 non-null float64 dtypes: float64(8), int64(17), object(1) memory usage: 41.8+ KB
from the above we can observe some columns have categorical data we need to transform it into numerical using Feature transform techniques¶
#convert cat into int
from sklearn.preprocessing import LabelEncoder
lab_obj = LabelEncoder()
data["fueltype"] = lab_obj.fit_transform(data["fueltype"])
data["aspiration"] = lab_obj.fit_transform(data["aspiration"])
data["doornumber"] = lab_obj.fit_transform(data["doornumber"])
data["carbody"] = lab_obj.fit_transform(data["carbody"])
data["drivewheel"] = lab_obj.fit_transform(data["drivewheel"])
data["enginelocation"] = lab_obj.fit_transform(data["enginelocation"])
data["enginetype"] = lab_obj.fit_transform(data["enginetype"])
data["cylindernumber"] = lab_obj.fit_transform(data["cylindernumber"])
data["fuelsystem"] = lab_obj.fit_transform(data["fuelsystem"])
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 car_ID 205 non-null int64 1 symboling 205 non-null int64 2 CarName 205 non-null object 3 fueltype 205 non-null int64 4 aspiration 205 non-null int64 5 doornumber 205 non-null int64 6 carbody 205 non-null int64 7 drivewheel 205 non-null int64 8 enginelocation 205 non-null int64 9 wheelbase 205 non-null float64 10 carlength 205 non-null float64 11 carwidth 205 non-null float64 12 carheight 205 non-null float64 13 curbweight 205 non-null int64 14 enginetype 205 non-null int64 15 cylindernumber 205 non-null int64 16 enginesize 205 non-null int64 17 fuelsystem 205 non-null int64 18 boreratio 205 non-null float64 19 stroke 205 non-null float64 20 compressionratio 205 non-null float64 21 horsepower 205 non-null int64 22 peakrpm 205 non-null int64 23 citympg 205 non-null int64 24 highwaympg 205 non-null int64 25 price 205 non-null float64 dtypes: float64(8), int64(17), object(1) memory usage: 41.8+ KB
data.head()
| car_ID | symboling | CarName | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | ... | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | alfa-romero giulia | 1 | 0 | 1 | 0 | 2 | 0 | 88.6 | ... | 130 | 5 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.0 |
| 1 | 2 | 3 | alfa-romero stelvio | 1 | 0 | 1 | 0 | 2 | 0 | 88.6 | ... | 130 | 5 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.0 |
| 2 | 3 | 1 | alfa-romero Quadrifoglio | 1 | 0 | 1 | 2 | 2 | 0 | 94.5 | ... | 152 | 5 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.0 |
| 3 | 4 | 2 | audi 100 ls | 1 | 0 | 0 | 3 | 1 | 0 | 99.8 | ... | 109 | 5 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.0 |
| 4 | 5 | 2 | audi 100ls | 1 | 0 | 0 | 3 | 0 | 0 | 99.4 | ... | 136 | 5 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.0 |
5 rows × 26 columns
#Drop unwanted columns
data = data.drop("car_ID",axis=1)
data = data.drop("CarName",axis=1)
data.head()
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | ... | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 1 | 0 | 1 | 0 | 2 | 0 | 88.6 | 168.8 | 64.1 | ... | 130 | 5 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.0 |
| 1 | 3 | 1 | 0 | 1 | 0 | 2 | 0 | 88.6 | 168.8 | 64.1 | ... | 130 | 5 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.0 |
| 2 | 1 | 1 | 0 | 1 | 2 | 2 | 0 | 94.5 | 171.2 | 65.5 | ... | 152 | 5 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.0 |
| 3 | 2 | 1 | 0 | 0 | 3 | 1 | 0 | 99.8 | 176.6 | 66.2 | ... | 109 | 5 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.0 |
| 4 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 99.4 | 176.6 | 66.4 | ... | 136 | 5 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.0 |
5 rows × 24 columns
#Divide the features into independent and dependent variables in terms of X and Y
x = data.iloc[:, 0:-1]
y = data.iloc[:, [-1]]
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler
x_scl = MinMaxScaler()
y_scl = MinMaxScaler()
x = x_scl.fit_transform(x)
y = y_scl.fit_transform(y)
# spliting data into train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
Linear Model Building¶
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
#fitting data to model
reg.fit(x,y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
reg.score(x,y)
0.8802434927425343
#making predictions
y_predict = reg.predict(x_test)
from sklearn.metrics import root_mean_squared_error
RMSE = root_mean_squared_error(y_pred=y_predict,y_true=y_test)
RMSE
0.0843969173808874
Polynomial Regression¶
import pandas as pd
import numpy as np
data = pd.read_csv("D:/NRIT Solutions/ML/Presentation/Linear Regression/Multi Linear Regression/Car Price Prediction Multiple Linear Regression/CarPrice_Assignment.csv")
#convert cat into int
from sklearn.preprocessing import LabelEncoder
lab_obj = LabelEncoder()
data["fueltype"] = lab_obj.fit_transform(data["fueltype"])
data["aspiration"] = lab_obj.fit_transform(data["aspiration"])
data["doornumber"] = lab_obj.fit_transform(data["doornumber"])
data["carbody"] = lab_obj.fit_transform(data["carbody"])
data["drivewheel"] = lab_obj.fit_transform(data["drivewheel"])
data["enginelocation"] = lab_obj.fit_transform(data["enginelocation"])
data["enginetype"] = lab_obj.fit_transform(data["enginetype"])
data["cylindernumber"] = lab_obj.fit_transform(data["cylindernumber"])
data["fuelsystem"] = lab_obj.fit_transform(data["fuelsystem"])
#Drop unwanted columns
data = data.drop("car_ID",axis=1)
data = data.drop("CarName",axis=1)
#Divide the features into independent and dependent variables in terms of X and Y
x = data.iloc[:, 0:-1]
y = data.iloc[:, [-1]]
#Feature Scaling
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
x_scl = StandardScaler()
y_scl = StandardScaler()
x = x_scl.fit_transform(x)
y = y_scl.fit_transform(y)
# spliting data into train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
# Create Polynomial Features (degree=2)
poly = PolynomialFeatures(degree=2)
x = poly.fit_transform(x)
x_poly_train = poly.fit_transform(x_train)
x_poly_test = poly.fit_transform(x_test)
#Fit Linear Regression on polynomial features
model = LinearRegression()
model.fit(x, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
model.score(x, y)
0.9987438222456763
model.score(x_poly_test, y_test)
0.9995488048241306
# Step 4: Make predictions
y_pred = model.predict(x_poly_test)
from sklearn.metrics import root_mean_squared_error
RMSE = root_mean_squared_error(y_true=y_test,y_pred=y_pred)
RMSE
0.023682050252017823